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A simple application of FIC to model selection. 
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We have recently proposed a new information-based approach to model selection, the Frequentist 
Information Criterion (FIC), that reconciles information-based and frequentist inference. The purpose 
of this current paper is to provide a simple example of the application of this criterion and a demon¬ 
stration of the natural emergence of model complexities with both AlC-like ( N °) and BIC-like (log N) 
scaling with observation number N. The application developed is deliberately simplified to make the 
analysis analytically tractable. 

PACS numbers: 


I. INTRODUCTION 

Although the predictivity of a model is a central objec¬ 
tive in model building in science, it is only one of a wide 
range of criteria considered. We also seek models that 
are motivated by our understanding of the underlying 
mechanisms that give rise to phenomena and the idea 
of model parsimony is often a useful guiding principle, 
especially in physics. In contrast to this broad view of 
model selection, this paper describes the application of 
a theory for model selection motivated and entirely jus¬ 
tified by a narrow definition of model predictivity: the 
ability of a model to predict a new observation gener¬ 
ated by a stochastic process, after the model parameters 
have been fit to a finite number of previous observa¬ 
tions. 


II. AN INFORMATION-BASED APPROACH. 


The model. Consider independent and identically dis¬ 
tributed observations X ~ p(-). The true probability dis¬ 
tribution p : —> K. is unknown and it is this function 

we are attempting to approximate from a finite number 
of observations: X N = (A) ,..., A'.y). The modeled prob¬ 
ability distribution will be written q. We wish to be ab¬ 
solutely explicit about the model and therefore we will 
distinguish between q, which will represent the proba¬ 
bility distribution for any model Mk, the model param¬ 
eter values 0 € © and a complexity index I\ = dim 6 
which we use to denote the dimension of model Mk- 
An important class of models is referred to as nested. 
Model Mb is said to be a nested model of model Ma if 
Ma is a special case of Mb'- There is exists a subset of 
model Mb parameter values that results in a probability 
distribution equal to that generated by model Ma- 
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Information. The Shannon information is defined: 

h(X\0, M) = - log q(X\0, M)/Sq, (1) 

where Sq = 8x~ D is a precision. In the interest of brevity, 
we will simply call this quantity information. The inter¬ 
pretation of this equations is as follows: h is the amount 
of information (the number of characters in a code) re¬ 
quired to specify X, given a model M with parameters 
0, to a precision 8x~ D where the units of h are nats. Nats 
are the unit of information corresponding to a code in 
which each character can assume e distinct values (base- 
e). Neither the units of information (base) nor the pre¬ 
cision have any mathematical significance. Changes in 
the former result in a scaling and changes in the latter re¬ 
sult in an offset. For mathematical convenience we will 
work in units such that Sq = 1 and information is always 
measured in units of nats (base-e). 

The cross entropy. Information is of central importance 
since it is the natural measure of model performance 
ma. The average information content of observation X 
is defined: 

H{6, K) = E x h(X\6,K) = - VVloggr*, (2) 
p 

where the expectation is understood to be taken with 
respect to the true probability distribution p. In the sec¬ 
ond equality, we have written the expectation as a dis¬ 
crete sum to make the point that H is an entropy. II is 
called Shannon Cross Entropy since while q approximates 
p, they are not equal. 

The determination of model parameters. The true and 
Maximum Likelihood Estimators (MLE) of the model pa¬ 
rameters are found by minimizing the cross entropy and 
information respectively: 


arg minfT(0, K), 

0 

(3) 

arg min h(X N \0 , A'), 

0 

(4) 


where the hat denotes an MLE and the A' subscript re¬ 
minds the reader that these parameters are a function 
of the observations X N . The true parameters are called 
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true, not because they are the parameterization of the 
true probability distribution p, but rather because these 
parameter values would be those fit if an infinite num¬ 
ber of observations could be collected. 

Model selection. Model selection from an information- 
based perspective is performed by identifying the 
model (parameterized by the MLE parameters) with 
the minimum cross entropy JT| (2)- This process has 
three equivalent interpretations: (i) Maximizing the pre- 
dictivity of the model, (ii) Minimizing the information 
loss due approximating the true probability distribu¬ 
tion with the model m and (iii) the Cross-Validation 
Heuristic in which the model is selected by choosing the 
model with the best expected performance when cross- 
validated against an independent set of data. To be clear, 
the mathematical realization of each of these interpreta¬ 
tions is identical 0. 

The key to understanding the information-based ap¬ 
proach is the appreciation that when models predict 
new observations, they are parameterized not by the 
true parameter values, but by the MLE parameters com¬ 
puted from previous observations. The failure of the 
MLE parameters to equal the true parameters leads to 
information loss or a degradation in the model predic- 
tivity which grows with model complexity. 

Consider the cross entropy evaluated at the MLE pa¬ 
rameters: 

H(0 x ,M)=E Y h(Y\e x ,M). (5) 

p 

One can understand the cross entropy as a cross- 
validation of the model: The model is trained against 
dataset X N and validated against dataset Y. Note that 
this expectation in the definition of the cross entropy is 
taken over the true probability distribution p, which is 
unknown, and therefore H cannot be computed. 

An estimator of the cross entropy H evaluated at the 
MLE parameters is the information for encoding dataset 
X N evaluated at the MLE parameters: 

H = h{X N \O x ,M), (6) 

which will we call the MLE information. (Note that 
whenever we discuss estimators of the cross entropy, it 
will be implicit that these are estimators evaluated at the 
MLE parameters.) This estimator is said to be biased 
since its expectation is not equal to the expectation of 
the cross entropy. Let us define the bias as follows: 

% = E A -,v {h(Y N \e x , M) - h(X N \O x ,M)} . (7) 

where 5 1 is the bias m or complexity. 3i is positive since 
it requires more information on average to encode inde¬ 
pendent observations Y N using 0\ than observations 
X N as a consequence of fitting the noise in the train¬ 
ing dataset X N . An unbiased estimator, //, for cross 


entropy evaluated at the MLE parameters can then be 
constructed: 

lC(X N ,M) = H = h(X N \d x ,M) + 3i, (8) 

where IK can now understood to be a penalty which pe¬ 
nalizes the model complexity. The estimator H is said to 
be unbiased since its expectation is equal to the expecta¬ 
tion of the cross entropy by construction. 

Although this approach would appear promising, 
there is a significant problem: We cannot compute the 
complexity in general since the true distribution p is un¬ 
known. Because we will introduce more than one ap¬ 
proximation for the value of the true complexity 3t, we 
will adopt the convention that when Jf. appears with a 
subscript, it is some particular approximation for the 
complexity whereas when it appears without a sub¬ 
script, it should be understood as the true complexity, 
the bias computed with respect to the true but unknown 
probability distribution p. 

The information criterion is extremely powerful in 
that it can be used to compare two distinct models, re¬ 
gardless of differences in the model parameterization. 
The model with the smallest IC value is expected to have 
smaller cross entropy and therefore result in greater pre- 
dictivity jTJ[2]. 

The Akaike Information Criterion. We will discuss two 
different approximations for computing the complexity. 
The first of these is the method originally described by 
Akaike, which gives rise to the Akaike Information Cri¬ 
terion (AIC). Akaike's great insight was to realize that, 
although the true distribution p might be unknown, for 
a large number of observations and a regular model the 
complexity is UHISSJ: 

3£aic = dim(0) = K, (9) 

where K is the number of continuous parameters 0 in 
the model Jli and is often referred to as either the degrees 
of freedom or the dimension of the model. Substituting the 
complexity into the definition of the information crite¬ 
rion results in the canonical Akaike Information Crite¬ 
rion (AIC): 

AIC (X n ,M k ) = h(X N \d x ,M K ) + K, (10) 

where this expression is written in units of nats |13| 
mm. AIC is the unbiased estimator of the cross en¬ 
tropy or equivalently average information for encoding 
N new observations in a model parameterized by the 
MLE parameters. Although AIC model selection is suc¬ 
cessful in many problems, it is also fails in some impor¬ 
tant contexts H3J1S0- 

Unidentifiable parameters. The reason for the failure 
of AIC is clear: A key assumption in the AIC derivation 
is that the MLE parameters are asymptotically normally 
distributed about their true values. Clearly this approx¬ 
imation can fail (e.g. ||6)) especially at finite N. The pre¬ 
cision with which a parameter is determined by the data 
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is determined by NI where I is the Fisher Information. 
For finite N, eigenvalues of NI can become small, re¬ 
sulting in a poorly specified MLE and a failure of the 
Laplace approximation 0. 

The Frequentist Information Criterion. In analogy to 
AIC, FIC is an approximation for the true complexity 
which is more generically applicable than the AIC ap¬ 
proximation J3. Consider the true complexity for the 
model p = q{-\9 , Mk)- 

3£fic {9,Mk)= Ex,y ih{Y N \9 X ,M X ) + ■■■ 

q(-\0,Jl K ) 1 

-h(x N \e x ,M K )}, (ii) 

where we have written the complexity as a functional of 
the true parameters 9 and the model complexity index 
K. To construct the model selection criterion, we use 
the true complexity for q (3£fic) to construct an approxi¬ 
mately unbiased estimator of the predictive information 
for p, which we will call the Frequentist Information Cri¬ 
terion in analogy to AIC: 

FIC = h(X N \9 x , Mk) + SCfic (9 x ,Mk), (12) 

where the complexity is evaluated at the MLE param¬ 
eters. (Note that the nature of the approximation is as 
follows: We assume that the true complexity for p is 
well approximated by the complexity for q evaluated at 
the MLE values.) The model that minimizes FIC has the 
smallest expected predictive information and the largest 
expected predictivity. 

An analytic approach to computing the FIC complex¬ 
ity. We develop and motivated this approximation else¬ 
where 0. Consider the difference in the complexity on 
the addition of a set of nested parameters. Note that 
instead of representing model complexity with the com¬ 
plexity index K , it is now convenient to use the nesting 
index n since in general the nesting procedure will in¬ 
crease the complexity index I\ by an increment larger 
than one. We will therefore represent the model more 
abstractly as M n where the index n specifies the number 
of nesting levels. 

Let M n -1 be the (n— l)th nested model and M n be the 
nth nested model. The complexity difference between 
the models can then be written: 

% n = 3fC n -JC n _ u (13) 

which we will call the nesting complexity (H). The com¬ 
plexity can be re-summed: 

n 

S^fic ( n ) —(14) 

i=0 

where the first term in the series, Lo, is defined by the 
direct computation of the complexity from the parent 
model before nesting. This calculation is typically per¬ 
formed using the AIC expression for the complexity. 


We exploit the following piecewise approximation for 
evaluating the nesting complexity for arbitrary param¬ 
eter values: Let the observed change in the MLE infor¬ 
mation for the nth nesting be 

A h n = h(X N \9 x , M n ) - h(X N \9 x ,M n . 1 ), (15) 

where n denotes the nth nesting of model M. (Note that 
the two instance of 9 x correspond to distinct parame¬ 
ter sets since they parameterize different models.) We 
define the piecewise approximation of the nesting com¬ 
plexity: 

-A h n < 

&_l_, otherwise 

where the complexity is implicitly dependent on A h n . 

When the new parameters are identifiable (—A h n > 
L_), the nesting complexity is approximated by the AIC 
nesting complexity: 

&+ = A K, (17) 

where A K is the number of harmonic parameters added 
to the model in the nesting procedure. When the pa¬ 
rameters are unidentifiable (—A h n < h-), the nesting 
complexity is the expectation of the extremum of m chi- 
squared random variables, each with -d degrees of free¬ 
dom: 

' k - = E x 2 m.a? XdW. ( 18 ) 

l<i<m 

~ 2 log to + ©(log log m). (19) 

The dimension <L is the number of harmonic degrees 
of freedom associated with the unidentifiable parame¬ 
ters) and m is the number of distinguishable models m, 
which can often be deduced from context (as discussed 
below) or can be derived more rigorously |[5j(Zl. 

The Bayesian Information Criterion. Before finishing 
the preliminaries, we introduce the so-called Bayesian 
Information Criterion (BIC) H2HSJ0. Despite its name 
and a similar mathematical form to AIC, BIC is mo¬ 
tivated by Bayesian statistics rather than information- 
based arguments. In Bayesian statistics the optimal 
model maximizes the marginal probability. BIC is an ap¬ 
proximation of the minus log marginal probability and 
is defined: 

BIC (X n ,M) = h(X N \9 x , M) + |FT log N, (20) 

where K is again the number of model parameters and 
N is the number of observations. Like AIC, BIC is an 
asymptotic result for large N and therefore it is clear 
that the BIC complexity (which scales like log N ) is sig¬ 
nificantly larger than the AIC complexity, resulting in 
"smaller" models (models with fewer parameters). Like 
AIC, BIC appears to be independent of the mathemat¬ 
ical details of the model and independent of the prior 
(which is required for a Bayesian approach). In fact the 
contribution of the prior is assumed to be order N° and 
can therefore be ignored in the large N limit. 
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III. APPLICATION: SEASONAL DEPENDENCE OF THE 
NEUTRINO INTENSITY 


In this section we have two principle aims: (i) To 
present a model selection analysis using AIC, BIC and 
FIC and (ii) To demonstrate the dependence of the 
FIC complexity on the model encoding algorithm. We 
present a model of simulated data inspired by the mea¬ 
surements of the seasonal dependence of the neutrino 
intensity detected at Super-Kamiokande. This will be a 
toy model in the sense that we will idealize and simplify 
the analysis. In particular, we will (i) bin the data into 
100 bins instead of analyzing time resolved events and 
assume that the mean neutrino intensity is (ii) smoothly 
varying in time, (iii) periodic with a period equal to one 
year and (iv) has a gaussian distribution in the event 
number with equal variance in all bins. To be clear, these 
are matters of mathematical convenience rather than ne¬ 
cessity. 

One possible approach to modeling the data is to sim¬ 
ply provide a list of N = 365 parameters, one mean //, 
for each day. The problem with this model encoding is 
that we know that the probability distribution for the 
intensity does not vary significantly with daily resolu¬ 
tion. As a result, the proposed model encoding will have 
imprecise parameterization and therefore poor predic- 
tivity. We therefore propose to expand the mean, as a 
function of time, as a Fourier series. This choice is not 
unique but the Fourier series has convenient mathemat¬ 
ical properties and can efficiently represent smooth pe¬ 
riodic functions. 


follows: 


e 


n 


M_! ... M_ n \ 
M 0 Mi ... M n ) ’ 


( 22 ) 


where all selected Mj are set to their respective MLE 
values and all other Mi are identically zero. We initial¬ 
ize the encoding algorithm by encoding the data with 
parameters 0 () . We then execute a sequential nesting 
procedure, increasing temporal resolution by adding 
the Fourier coefficients M±i corresponding to the next 
smallest integer frequency index i, in sequential order. 
(Note that there are two Fourier coefficients at every 
frequency, labeled ±i, except at i = 0.) The cutoff fre¬ 
quency is indexed by n and is determined by the model 
selection criterion. 


AIC and FIC. From the AIC perspective the complexity 
is simply a matter of counting the continuous parame¬ 
ters fit for each model as a function of the nesting index. 
Counting the parameters in Eqn. 22 gives the expression 
for the complexity: 


3£ A ic = 2n + 1, (23) 

since both an Mi and an AT_, are added at every level. 
Since there is no ambiguity in the MLE parameter val¬ 
ues, FIC predicts the same complexity as AIC. 

Bayes complexity. In the Bayesian analysis, we invoke 
the BIC result (a complexity of \ log N per degree of 
freedom). By an analogous argument to the AIC rea¬ 
soning, the complexity is therefore: 


Simulated data. In respect to the complexity of true ex¬ 
perimental data, we will choose a true mean intensity 
dependence on the discrete-time index j that cannot be 
represented as a finite number of Fourier coefficients: 

Hj = \/l20 + 100sin(27 rj/N + tt/6) AU, (21) 

where the variance is a 2 = 1 AU 2 and the data has been 
binned into N = 100 bins. The generating model, sim¬ 
ulated data and two model fits are shown in Figure [lj 
Panel A. 


3C' = i(2n + 1) log A”, (24) 

where N = 100. This complexity is clearly significantly 
larger than the AIC complexity. 

Greedy-Algorithm Model. In some contexts it may not 
make sense to start with the lowest frequency terms and 
work sequentially towards higher frequency. An alter¬ 
native approach would be to consider all the Fourier co¬ 
efficients and select the largest magnitude coefficients to 
construct the model. In the Greedy Algorithm we will rep¬ 
resent the Fourier coefficients as follows: 


Analysis of the data. We expand the model mean (pi) 
and observed intensity ( xt ) in Fourier coefficients M, 
and Xi respectively. (The details of the model repre¬ 


sentation are discussed in the Appendix, Section A1) 


The MLE parameters that minimize the information 
are M, = X,. We now introduce two different ap¬ 
proaches to encoding our low-level model parameters 
{Mi}i=-N/2...N/2 : The Sequential and Greedy Algorithms. 
Note that in both cases, the models will be represented 
by non-zero subsets of the same underlying model pa¬ 
rameters, the Fourier coefficients (Mi). 


e 


n 


0 i\ ... i n \ 
Mo M n ... M in J ’ 


(25) 


where the first row represents the Fourier index and the 
second row is the corresponding Fourier coefficient. As 
before, all unspecified coefficients are set to zero. We 
initialize the encoding algorithm by encoding the data 
with parameters do and then we execute a sequential 
nesting procedure: At each step in the nesting process, 
we chose the Fourier coefficient with the largest magni¬ 
tude (not already included in 0 n _ i). The optimal nesting 
cutoff will be determined by model selection. 


Sequential-Algorithm Model. In the Sequential Algor- AIC complexity. To compute the AIC complexity, we 
thim we will represent our nested-parameter vector as again count the model parameters. One might be 
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tempted to set the complexity equal to the complexity 
for the Sequential Algorithm since there are two parame¬ 
ters added in each nesting step. But, one of these param¬ 
eters is an integer index and is therefore not expected to 
be harmonic [T4|. Therefore we expect the complexity 
term to be 


3 £aic = n + 1, ( 26 ) 

where n is the nesting index. 

FIC complexity. After the algorithm is initialized, 
each nesting step chooses the largest Fourier coefficient, 
therefore the meaning of the index ij is unidentifiable 
when there are no resolvable Fourier coefficients re¬ 
maining. We define the complexity in terms of the nest¬ 
ing penalties k±. When a coefficient is identifiable, there 
is no ambiguity and we recover the AIC result: H + = 1. 
When the next coefficient is not resolvable, we evaluate 
Eqn. [l9]for the nesting complexity for m = N since each 
Fourier coefficient is independent and for chi-squared 
dimension -d = 1 corresponding to the dimension of the 
added Fourier coefficient: 

= 2 log N + © (log log N) (27) 

We assemble this piecewise complexity using Eqns. [16] 
and [14] In short, the initial slope of the compleixty -X 
with respect to n is transitioning to ft- at the optimal 
model size. (See Figure]!] Panel D.) 

Bayes complexity. By similar arguments to the AIC 
analysis, we expect a single BIC-like contribution from 
each Fourier coefficient M, ,. For the integer index ij, the 
most sensible uninformative prior to give is p = A -1 
since the index can take any one of N — n values. We 
therefore expect the complexity to be 

3C' = |(1 + n) log A+ n log N, (28) 

M il ii 

where n is the nesting index and N is the number of 
bins (observations), we have assumed N n and the 
source on the contributions to the complexity are shown 
explicitly. 

Visualization of FIC model selection. In Figure[l]Panel 
A, the true mean (green), simulated data (green points) 
and the Sequential (red) and Greedy-Algorithm Mod¬ 
els (blue) are shown. Qualitatively, it is clear that the 
Sequential-Algorithm Model (red) results in a better ap¬ 
proximation of the true model (green) than the Greedy- 
Algorithm Model (blue). In Panel B, we show the mag¬ 
nitude of the Fourier coefficients as a function of the 
frequency index i for the Sequential-Algorithm Model. 
The red dotted lines represents the model selection cut¬ 
off which corresponds to the index where the true model 
coefficients (green points) begin to significantly diverge 
from the fit model coefficients (red points). The FIC 
Model Selection criterion correctly identifies this transi¬ 
tion. In Panel C, the MLE information and FIC for the 


Sequential (red) and Greedy-Algorithm Models (blue) 
are plotted as a function of the nesting index n. The 
optimum model minimizes the estimated cross entropy 
(FIC). In this case both models happen to have the 
same cutoff index, n = 2. Although the cutoff index 
is the same, the Sequential-Algorithm Model encodes 
two Fourier coefficients per nesting level versus one 
coefficient per nesting level in the Greedy-Algorithm 
Model. The slope of the information for the Greedy- 
Algorithm Model (dashed blue) is clearly significantly 
more negative than the slope of Sequential-Algorithm 
Model (dashed red) indicative of a larger complexity. 
Although both models have the same nesting cutoff, 
the Sequential-Algorithm Model results in a lower es¬ 
timated FIC at its minimum, and it is therefore the pre¬ 
ferred model, matching our intuitive sense from com¬ 
paring the two models to the true model in Panel A. 


In the context of a simulation, the true probability dis¬ 
tribution is known. Therefore we can compute the true 
complexity in both encoding algorithms as a function of 
nesting index and compare it to the FIC approximation, 
which is made without knowledge of the true distribu¬ 
tion. (To make this distinction between models clear, we 
increase the number of bins to N = 1000 for this calcu¬ 
lation.) This comparison is shown in Figure [l] Panel D. 
The FIC complexity (3 £fiC/ solid line) is clearly a good 
approximation for the true complexity in both models 
(points). For nesting indices significantly larger than 
the optimal index, the FIC approximation for the com¬ 
plexity fails since the model assumed to compute the 
complexity is a poor approximation for the true model. 
In the Sequential-Algorithm Model, the complexity is 
AlC-like since the slope is independent of the number 
of observations N. In the Greedy-Algorithm Model, the 
complexity is BIC-like since the slope is proportional to 
log N. (Note that Panel D shows a plot with respect to 
the nesting index n, not the number of observations TV.) 
In the Greedy-Algorithm Model, the transition in the 
complexity between the AlC-like and BIC-like regimes 
can clearly be seen at the optimal nesting index n = 4, 
exactly as predicted by Eqn. 16] In both cases, the com¬ 


plexity is correctly captured by FIC. 


FIC vs AIC and BIC. In Figure [T]we show only the re¬ 
sults of the FIC model selection. Both AIC and BIC 
fail to predict the correct complexity scaling for one of 
the two algorithms. In the Sequential algorithm, AIC 
predicts the correct complexity and the BIC estimate of 
the complexity is too pessimistic Ifl5l . This situation 
may be tolerable for N = 100 but for very large N , 
the model selection criterion for BIC becomes extremely 
strict. Of course the situation is reverse in the context 
of the Greedy algorithm where the AIC complexity is 
much too weak to lead to model selection whereas the 
BIC result at least predicts the correct scaling with N, 
even if the coefficient is incorrect. FIC by contrast accu¬ 
rately predicts the true complexity in both scenarios. 
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IV. DISCUSSION 

The model encoding algorithm determines the com¬ 
plexity. The neutrino analysis was purposefully con¬ 
structed to illustrate the importance of the encoding al¬ 
gorithm. The FIC complexity clearly differentiates be¬ 
tween the Sequential and Greedy Algorithms in spite of 
the fact that both algorithms have the same low-level 
representation in terms of subsets of non-zero Fourier 
coefficients lH6ll . Unlike the AIC formalism, FIC depends 
on the encoding algorithm of the model. This depen¬ 
dence is explicit in the definition of FIC complexity 
(Eqn. [lTj, as opposed to the AIC approach which is al¬ 
gorithm independent. We note that the FIC formalism 
allows more general estimators than the MLE. 

The presence of unidentifiable parameters determines 
complexity scaling. The key differentiator between the 
two algorithms presented for the encoding of the neu¬ 
trino data was the presence of an unidentifiable pa¬ 
rameter in the Greedy Algorithm (the frequency index 
if), which did not generate a consistent MLE for small 
N. As a consequence, the complexity (equivalent to 
the parameter-encoding information) is large as a con¬ 
sequence of the need to resolve this ambiguity. Uniden¬ 
tifiable parameters arise as the consequence of near¬ 
zero eigenvalues of the Fisher Information Matrix that 
result in inconsistent estimators for small N. In non- 
pathological models (model without eigenvalues of the 
Fisher Information Matrix that are exactly zero), a suffi¬ 
ciently large number of observations will lead any par¬ 
ticular parameter to become identifiable. 

The FIC complexity can exhibit AIC, BIC and more 
general scaling. The analysis of the Neutrino system 
gave examples of both canonical complexity scalings 


with observation number (TV): In the Sequential Algo¬ 
rithm the complexity is clearly AlC-like (N°). In the 
Greedy Algorithm the complexity has a BIC-like scaling 
(log N). Motivated by this example, one might hypothe¬ 
size that the scaling is always either AIC or BIC-like. We 
offer a counter example: The Change-Point Algorithm 
0. In this example, the number of independent models 
scales like log N therefore the complexity scales like 

~ 2 log log N. (29) 

We present a detailed analysis of this problem elsewhere 
0 - 

Conclusion. In this paper we have intentionally pre¬ 
sented a simplified application of the Frequentist Infor¬ 
mation Criterion (FIC) to demonstrate an analytically 
tractable example. In more complex applications, the 
complexity should be computed numerically. In partic¬ 
ular, we have chosen an example where the complexity 
depends on the parameters 6 in a trivial way, but this is 
a special case. More generally the complexity must be 
computed for all parameter values of interest. 

Unlike AIC, FIC is widely applicable since it accu¬ 
rately approximates the complexity in both regular and 
singular models. In contrast to the Bayesian approach, 
no ad hoc prior need be specified explicitly or implicitly 
(as is the case in BIC). Furthermore, while FIC can be un¬ 
derstood as equivalent to a frequentist approach, there 
is no need to specify a null hypothesis, statistic, test or 
confidence level as is typically the case in Frequentist in¬ 
ference. The FIC approach is therefore free of many of 
the perceived shortcomings of both Bayesian and Fre¬ 
quentist approaches to inference and is more generally 
applicable than previously proposed information-based 
approaches to inference. 
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dependent observations (or trials) adds while their prob¬ 
abilities multiply. The typical performance of a model 
should therefore be understood as either the geometric 
average of the probability or equivalently the arithmetic 
average of the negative information (a.k.a. the negative 
cross entropy). We give a more detailed explanation in 
Ref. 0. 

[11] In the appendix, Section ?? we discuss (ii) in the context 
of the Kulback-Leibler Divergence. 
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FIG. 1: Model selection on simulated neutrino data. Panel A: Truth, data and models. The true mean intensity is plotted (solid 
green) as a function of season, along with the simulated observations (green points) and models encoded using two different 
algorithms. Sequential (red) and Greedy (blue). The Sequential Algorithm results in a significantly better fit to the observed data. 
Panel B: Fourier coefficient magnitudes. The magnitude of the Fourier coefficients Cj is plotted as a function of frequency index 
j for the Sequential Algorithm Model. Below the cutoff (dotted red line), there is qualitative agreement between the true values 
(green points) and the fit coefficients (red points). The model selection criterion correctly identifies the transition from average 
information loss to gain, as illustrated by the widely divergent true and fit coefficients for j > 3. Panel C: Encoding information. 
The encoding information is plotted as a function of the nesting index n. The true information is compared with the information 
for Sequential (red) and Greedy (blue) Algorithm Models. The dashed curves represent the information as a function of nesting 
index and both are monotonically decreasing. The solid curves (red and blue) represents the estimated average information (FIC), 
which is equivalent to estimated model predictivity. The model selection criterion chooses the model size (nesting index) that is a 
minimum of FIC. Panel D: The true complexity matches FIC estimates. (Simulated for N = 1000.) In the Sequential-Algorithm 
Model, the true complexity (red dots) is AlC-like (solid red). In the Greedy-Algorithm Model, the true complexity (blue dots) 
transitions from AlC-like (slope = 1) to BIC-like (slope oc log N) at the cutoff nesting index n = 4. In both cases, the true 
complexity is correctly predicted by FIC (solid curve). 


[12] We have defined the bias with the opposite sign to what 
is common practice in order that there be no distinction 
between the sign of the complexity, to be defined, and the 
bias. 

[13] Historically information criteria are usually written in 
units of demi-nats, resulting in a numerical expression 
that is twice the definition we give. 

[14] Only parameters on which the information has an ap¬ 


proximately quadratic dependence contribute. 

[15] The BIC complexity is said to be incorrect since it leads to 
significant information loss compared with the optimal 
model cutoff. 

[16] Clearly the models are parameterized by different non¬ 
zero subsets of the M l . 
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Appendix A: Additional Applications 

In each of the following applications we will assume we are analyzing intensity measurements associated with 
some degree of freedom, discrete index j. The model for the intensity will be a gaussian distribution where the mean 
intensity depends on j but the variance is constant and is assumed to be known. We will write the intensity as Xi for 
consistency with the derivations described in the results section. The model probability distribution can therefore be 
written: 

<lA x j\ e ) = J—» ex P [-(•- Zb') 2 / 20-2 ] > (Al) 

\J2'no A 

where the mean intensity are encoded by the model parameters 6. In each of the two examples below, we will 
discuss different encodings relevant for different experimental scenarios. In each case the complexity term will have 
a different form due to the differences in model encodings. We will assume that N is large since this enables us to 
invoke some analytic approximations. 

Before we continue let me note that the toy models discussed here are described as simply as possible to make a 
point about the encoding and the complexity. The fact that we will represent time and energy as a discrete index 
is of no significance. It is straightforward to treat time (or energy) resolved data by likelihood-based techniques. 
Furthermore, the fact that we use a gaussian distribution instead of a more general distribution is a computational 
convenience, no more. The same is true of the large N limit. In principle one can use the same techniques for any 
number of observations. Finally as mentioned before, we will assume the variance is known. Again, this assumption 
is not required and is rather a computation convenience. 

Data-encoding information. In all cases below, the data-encoding information, obtained by substituting the model 
pdf (Eqn. |Al) into the definition of the data-encoding information (Eqn. ??) can be written as follows: 

h(X\0) = — log27rcr 2 + ^ ~ Vi) 2 , (A2) 

where we shall assume throughout that /itj, the mean intensity, is parameterized by model parameters 6 and the 
variance er 2 is a known parameter. 


1. Details: Seasonal dependence of the neutrino intensity 


Analysis of the data. We expand the model mean (//,) and observed intensity (x,) into Fourier coefficients Mi and 
Xi respectively: 


N/2 N 

Vj = AfiV’iO") where Mj = ^ MjV’iO'), 

i=-N/ 2 i=l 

N/2 N 

Xj = ^2 x ii , i{j) where Xj = '5^ j x j i/) i (j), 


i=-N/2 

where the orthonormal Fourier basis functions are defined: 


3 =1 


\J2 cos(2nij /N), i < 0 

i>iU)= N ~ 1/2 { 1, % = 0 

\/2 sin(27 rij/N), i > 0. 

Substituting these expressions into the expression of the data-encoding information gives 

N 1 N ^ 2 

h{X\G) = — log 2na 2 + —^ £ (X, - M,) 2 , 


i=—N/2 


(A3) 

(A4) 


(A5) 


(A6) 


where we have used the orthagonality in the large N limit for all terms. We chose the eigen function normalization 
in order to give this expression its concise form, analogous to Eqn. A2 





